[SOUND] This lecture is about the feedback
in the language modeling approach.
In this lecture we will continue the
discussion of feedback in text retrieval.
In particular we're going to talk about
the feedback in language modeling
approaches.
So we derive the query likelihood ranking
function by making various assumptions.
As a basic retrieval function, that
formula, or those formulas worked well.
But if we think about the feedback
information, it's a little bit awkward to
use query likelihood to
perform feedback because
a lot of times the feedback information is
additional information about the query.
But we assume the query is
generated by assembling words
from a language model in
the query likelihood method.
It's kind of unnatural to sample,
words that, form feedback documents.
As a result, then research is proposed,
a way to generalize query
likelihood function.
It's called a Kullback-Leibler
divergence retrieval model.
And this model is actually,
going to make the query likelihood,
our retrieval function much
closer to vector space model.
Yet this, form of the language model can
be, regarded as a generalization of query
likelihood in the sense that if it can
cover query likelihood as a special case.
And in this case the feedback
can be achieved through
simply query model estimation or updating.
This is very similar to Rocchio
which updates the query vector.
So let's see what the, is the scale
of divergence, which we will model.
So, on the top, what you see is query
likelihood retrieval function,
all right, this one.
And then KL-divergence or
also called cross entropy retrieval
model is basically to
generalize the frequency part,
here, into a layered model.
So basically it's the difference,
given by the probabilistic model here
to characterize what the user's looking
for versus the kind of query words there.
And this difference allows us to plotting
various different ways to estimate this.
So this can be estimated in many different
ways including using feedback information.
Now this is called a KL-divergence because
this can be interpreted as measuring
the KL-divergence of two distributions.
One is the query model
denoted by this distribution.
One is the talking,
the language model here.
And [INAUDIBLE] though is a [INAUDIBLE]
language model, of course.
And we are not going to talk
about the detail of that, and
you'll find the things in references.
It's also called cross entropy,
because, in, in fact,
we can ignore some terms in the
KL-divergence function and we will end up
having actually cross entropy, and that,
both are terms in information theory.
But, anyway for
our purposes here you can just see
the two formulas look almost identical,
except that here we have a probability of
a word given by a query language model.
This, and here,
the sum is over all the words
that are in the document,
and also with the non-zero probability for
the query model.
So it's kind of, again, a generalization
of sum over all the matching query words.
Now you can also, easy to see,
we can recover the query likelihood,
which we will find here by as simple
as setting this query model to
the relative frequency of
a word in the query, right?
This is very to easy see
once you practice this.
And to here, you can eliminate this
query lens, that's a constant,
and then you get exactly like that.
So you can see the equivalence.
And that's also why this KL-divergence
model can be regarded as a generalization
of query likelihood because we can cover
query likelihood as a special case,
but it would also allow it
to do much more than that.
So this is how we use the KL-divergence
model to then do feedback.
The picture shows that we first
estimate a document language model,
then we estimate a query
language model and
we compute the KL-divergence,
this is often denoted by a D here.
But this basically means,
this was exactly like in vector space
model because we compute the vector for
the document in the computer and
not the vector for the query,
and then we compute the distance.
Only that these vectors
are of special forms,
they have probability distributions.
And then we get the results, and
we can find some feedback documents.
Let's assume they are more selective
sorry, mostly positive documents.
Although we could also consider
both kinds of documents.
So what we could do is, like in Rocchio,
we can compute another language model
called feedback language model here.
Again, this is going to be another vector
just like a computing centroid vector in
Rocchio.
And then this model can be
combined with the original
query model using a linear interpolation.
And this would then give us an updated
model, just like again in Rocchio.
Right, so here, we can see the parameter
of our controlling amount of feedback if
it's set to 0,
then it says here there's no feedback.
After set to 1, we've got full feedback,
we can ignore the original query.
And this is generally not desirable,
right.
So this unless you are absolutely sure you
have seen a lot of relevant documents and
the query terms are not important.
So of course the main question here
is how do you compute this theta F?
This is the big question here.
And once you can do that,
the rest is easy.
So here we'll talk about
one of the approaches.
And there are many approaches of course.
This approach is based on generative model
and I'm going to show you how it works.
This is a user generative mixture model.
So this picture shows that
the we have this model here,
the feedback model that
we want to estimate.
And we the basis is the feedback options.
Let's say we are observing
the positive documents.
These are the collected documents by
users, or random documents judged by
users, or simply top ranked documents
that we assumed to be random.
Now imagine how we can
compute a centroid for
these documents by using language model.
One approach is simply to assume
these documents are generated from
this language model as we did before.
What we could do is do it,
just normalize the word frequency here.
And then we,
we'll get this word distribution.
Now the question is whether this
distribution is good for feedback.
Well you can imagine well the top
rank of the words would be what?
What do you think?
Well those words would be common words,
right?
As well we see in, in the language model,
in the top right, the words are actually
common words like, the, et cetera.
So, it's not very good for feedback,
because we will be adding a lot of such
words to our query when we interpret,
this was the original query model.
So, this is not good, so
we need to do something, in particular,
we are trying to get rid
of those common words.
And we all, we have seen actually one way
to do that, by using background language
model in the case of learning
the associations with of words, right.
The words that are related
to the word computer.
We could do that, and
that would be another way to do this.
But here, we're going to
talk about another approach,
which is a more principled approach.
In this case, we're going to say, well,
you, you said that there are common words
here in this, these documents that should
not belong to this top model, right?
So now, what we can do is to assume that,
well, those words are, generally,
from background language model,
so they will generate a,
those words like the, for example.
And if we use maximum
likelihood estimated,
note that if all the words here
must be generated from this model,
then this model is forced to assign
high probabilities to a word like the,
because it occurs so frequently here.
Note that in order to reduce its
probability in this model, we have to
have another model, which is this one
to help explain the word, the, here.
And in this case,
it's not appropriate to use the background
language model to achieve this goal
because this model will assign high
probabilities to these common words.
So in this approach then, we assume
this machine that which generated
these words would work as follows.
We have a source controller here.
Imagine we flip a coin here to
decide what distribution to use.
With the probability of lambda
the coin shows up as head.
And then we're going to use
the background language model.
And we can do then sample
word from that model.
With probability of 1 minus lambda now,
we now decide to use a unknown topic
model here that we will try to estimate.
And we're going to then
generate a word here.
If we make this assumption, and this
whole thing will be just one model, and
we call this a mixture model,
because there are two distributions
that are mixed here together.
And we actually don't know when
each distribution is used.
Right, so again think of this
whole thing as one model.
And we can still ask it for words, and
it will still give us a word
in a random method, right?
And of course which word will show up
will depend on both this distribution and
that distribution.
In addition,
it would also depend on this lambda,
because if you say,
lambda is very high and
it's going to always use the background
distribution, you'll get different words.
If you say, well our lambda is very small,
we're going to use this, all right?
So all these are parameters,
in this model.
And then, if you're thinking this way,
basically we can do exactly the same as
what we did before, we're going to use
maximum likelihood estimator to adjust
this model to estimate the parameters.
Basically we're going to adjust,
well, this parameter so
that we can best explain all the data.
The difference now is that we are not
asking this model alone to explain this.
But rather we're going to ask
this whole model, mixture model,
to explain the data because it has got
some help from the background model.
It doesn't have to assign high
probabilities towards like the,
as a result.
It would then assign high probabilities
to other words that are common here but
not having high probability here.
So those would be common here.
Right?
And if they're common they would
have to have high probabilities,
according to a maximum
likelihood estimator.
And if they are rare here,
all right, so if they are rare here,
then you don't get much help
from this background model.
As a result, this topic model
must assign high probabilities.
So the higher probability words
according to the topic model
will be those that are common here,
but rare in the background.
Okay, so, this is basically a little
bit like a idea for weighting here.
This would allow us to achieve
the effect of removing these top words
that are meaningless in the feedback.
So mathematically what we have is
to compute the likelihood again,
local likelihood of
the feedback documents.
And, and note that, we also have
another parameter, lambda here.
But we assume that lambda denotes
noise in the feedback document.
So we are going to, let's say,
set this to a parameter, let's say,
say 50% of the words are noise,
or 90% are noise.
And this can then be,
assume it will be fixed.
If we assume this is fixed, then we only
have these probabilities as parameters
just like in the simplest unigram
language model, we have n parameters.
n is the number of words and, then, the
likelihood function will look like this.
It's very similar to the likelihood
function, normal likelihood
function we see before except that inside
the logarithm there's a sum in here.
And this sum is because we can
see the two distributions.
And which ones used would depend on
lambda and that's why we have this form.
But mathematically this is the function
with theta as unknown variables, right?
So, this is just a function.
All the other variables are known,
except for this guy.
So, we can then choose this
probability distribution to
maximize this log likelihood.
The same idea as the maximum
likelihood estimator.
As a mathematical problem which is to,
we just have to solve this
optimization problem.
We said we would try all
of the theta values, and
here we find one that gives this
whole thing the maximum probability.
So, it's a well-defined math problem.
Once we have done that,
we obtain this theta F,
that can be the interpreter with
the original query model to do feedback.
So here are some examples of
the feedback model learned from a web
document collection, and
we do pseudo-feedback.
We just use the top 10 documents,
and we use this mixture model.
So the query is airport security.
What we do is we first retrieve ten
documents from the web database.
And this is of course pseudo-feedback,
right?
And then we're going to feed to that
mixture model, to this ten document set.
And these are the words
learned using this approach.
This is the probability of a word given
by the feedback model in both cases.
So, in both cases, you can see
the highest probability of words
include very random words to the query.
So, airport security for example,
these query words still show
up as high probabilities
in each case naturally because they occur
frequently in the top rank of documents.
But we also see beverage, alcohol,
bomb, terrorist, et cetera.
Right, so these are relevant
to this topic, and they,
if combined with original query can help
us match more accurately, on documents.
And also they can help us bring up
documents that only managing the,
some of these other words.
And maybe for example just airport and
then bomb for example.
These so,
this is how pseudo-feedback works.
It shows that this model really works and
picks up mm,
some related words to the query.
What's also interesting is that if
you look at the two tables here, and
you compare them, and you see in this
case, when lambda is set to a small value,
and we'll still see some common
words here, and that means.
When we don't use the background
model often, remember lambda can
use the probability of using the
background model to generate to the text.
If we don't rely much on background model,
we still have to use this topped model
to account for the common words.
Whereas if we set lambda to a very
high value we would use the background
model very often to explain these words,
then there is no burden on
expanding those common words in the
feedback documents by the topping model.
So, as a result, the top of the model
here is very discriminative.
It contains all the relevant
words without common words.
So this can be added to the original
query to achieve feedback.
So to summarize in this lecture we
have talked about the feedback in
language model approach.
In general,
feedback is to learn from examples.
These examples can be assumed examples,
can be pseudo-examples,
like assume the, the top ten
documents are assumed to be random.
They could be based on using
fractions like feedback,
based on quick sorts or implicit feedback.
We talked about the three major
feedback scenarios, relevance feedback,
pseudo-feedback, and implicit feedback.
We talked about how to use Rocchio to
do feedback in vector-space model and
how to use query model estimation for
feedback in language model.
And we briefly talked about
the mixture model and
the basic idea and
there are many other methods.
For example the relevance model
is a very effective model for
estimating query model.
So, you can read more about the,
these methods in the references that
are listed at the end of this lecture.
So there are two additional readings here.
The first one is a book that
has a systematic, review and
discussion of language models
of more information retrieval.
And the second one is an important
research paper that's about relevance
based language models and it's a very
effective way of computing query model.
[MUSIC]

